NAR Genomics and Bioinformatics
◐ Oxford University Press (OUP)
Preprints posted in the last 90 days, ranked by how well they match NAR Genomics and Bioinformatics's content profile, based on 214 papers previously published here. The average preprint has a 0.11% match score for this journal, so anything above that is already an above-average fit.
Kawato, S.
Show abstract
MotivationGenerating graphical diagrams of microbial and organellar genomes is a common and essential task in bioinformatics. Existing tools often present a trade-off; while powerful programming libraries that require coding skills, graphical applications require server processing or local installation with complex dependency. This highlights the need for a tool that offers both programmatic control for batch processing and graphical accessibility for ease of use. ResultsTo fill this gap, I developed gbdraw, a web application that generates circular and linear genome diagrams from self-contained GenBank or DDBJ files or combinations of GFF3 annotation and FASTA sequence files. Its core functions include visualizing annotated features, plotting GC content/skew tracks, and optionally generating pairwise sequence comparisons for comparative genomics. It is available as both a GUI web application and a command-line utility. Unlike existing web-based tools that require data upload to a remote server, gbdraw operates entirely within the users web browser. This serverless architecture ensures that sensitive sequence data never leaves the local machine, providing a secure environment for visualizing unpublished genomic data. Availability and Implementationgbdraw is implemented in Python 3 (version 3.10+) and is freely available under the MIT license. The web app is available at https://gbdraw.app/. Source code and documentation are available at https://github.com/satoshikawato/gbdraw. The local version can be installed from the Bioconda channel using a conda-compatible package manager.
Haddox, S.; Mao, Y.; Tajammal, A.; Engel, J.; Lynch, S.; Huang, N.; Raby, K.; Kian, A.; Li, H.
Show abstract
Chimeric RNA molecules, which contain nucleotide sequences originating from multiple genes, are generated by chromosomal rearrangements, transcriptional read-throughs, or trans-splicing between separate parental transcripts. Chimeric RNAs have been functionally validated in both pathological and normal healthy physiological contexts indicating the biological significance of chimeric RNA expression. There is, however, currently no standard for computationally quantifying chimeric RNA expression and only limited benchmarking data available for the few chimeric RNA detection software that attempt to measure the abundance of the predicted chimeras. Here, we develop the relative index of chimeric expression, RICE, that is calculated based on the relative expression of chimeric transcripts compared to the respective parental WT transcripts. We evaluate three different methods for generating this measurement from simulated RNA sequencing data with known transcript abundances. Our BLAST-based approach outperforms STAR and Kallisto based approaches when considering both accuracy and consistency between simulated data of different read lengths and sequencing depths. We further demonstrate that RICE values can be validated using qPCR and are sensitive to dynamic conditions using siRNA targeting chimeric RNA expression. Finally, we apply our RICE analysis pipeline to clinical prostate cancer data. We quantify over 1200 chimeric RNAs in primary prostate cancer, metastatic prostate cancer, and non-cancer tissue samples from GTEx. Our differential RICE analysis revealed a clustering of prostate cancer tissue samples from three different sequencing cohorts distinct from their associated tissue type noncancer GTEx clusters. Our pipeline is publicly available on github and can be run on a personal laptop with computational resources and processing time dependent on the number of quantified chimeras.
O'Hanlon, D.; Garcia Busto, S.; Perez Carrasco, R.
Show abstract
Mutual information is a fundamental quantity in information theory that describes the non-linear dependency between two variables, and has numerous applications within bioinformatics and beyond. However, its exploitation is hampered by a trade-off between computational intensity and accuracy. Here we present an adaptive binning approach to computing the pairwise mutual information, optimized for small integer counts such as those observed in single-cell RNA sequencing. By assuming a sampling distribution such as the negative binomial, a {chi}2 test statistic for hypothesis testing can be computed simultaneously via a copula transformation. Using these quantities, we show how gene rewiring of CD4+ naive T-cells during SARS-CoV-2 infection can be studied using a single-cell sequencing dataset of healthy and COVID-19 donors.
Alchaar, M.; Dogan, B.
Show abstract
Dimensionality reduction for visualization is a fundamental step in single-cell RNA sequencing (scRNA-seq) analysis due to the extremely high dimensionality of gene expression profiles. However, widely used nonlinear embedding techniques such as UMAP and t-SNE can introduce substantial distortions when projecting data into two-dimensional space, potentially altering global organization, local neighborhoods, and distance relationships in ways that may mislead downstream biological interpretation. In this study, we investigate the applicability of Clustering-Based Manifold Approximation and Projection (CBMAP) for the visualization of scRNA-seq data and systematically examine how clustering strategies influence the quality of the resulting embeddings. CBMAP was integrated with several clustering algorithms commonly used in single-cell analysis, including k-means, Leiden, HDBSCAN, Secuer, HGC, and FlowSOM. The resulting embeddings were evaluated using quantitative metrics that measure global, local, and distance-level structure preservation and were compared with widely used dimensionality reduction methods such as UMAP, t-SNE, and PaCMAP across multiple benchmark datasets. Our results demonstrate that the clustering stage plays a critical role in determining the structural fidelity of CBMAP embeddings. Clustering algorithms specifically designed for single-cell transcriptomic data, particularly Secuer, produced more consistent preservation of global relationships between cell populations. Across multiple datasets, CBMAP more faithfully preserved global structural organization and inter-population distance relationships than the compared methods, although local neighborhood preservation was generally weaker than in techniques optimized for local structure. Importantly, CBMAP embeddings retained biologically meaningful relationships in trajectory benchmark datasets. When combined with RNA velocity analysis, CBMAP successfully preserved cyclic progenitor states and branching differentiation trajectories, demonstrating compatibility with trajectory-aware visualization. These findings indicate that CBMAP provides a structure-faithful visualization framework for scRNA-seq data and that clustering selection plays a central role in determining embedding quality.
WANG, Z.; Arsuaga, J.
Show abstract
Computational bacteriophage host prediction from genomic sequences remains challenging because host range depends on diverse, rapidly evolving genomic determinants--from receptor-binding proteins to anti-defense systems and downstream infection compatibility--and because the signals available to predictors, including sequence homology, CRISPR spacer matches, nucleotide composition, and mobile genetic elements, are sparse, unevenly distributed across taxa, and constrained by incomplete host annotations. Here, we frame host prediction as an unsupervised retrieval problem. We asked whether embeddings from the pretrained genome language model Evo2 captured a reliable host-range signal without training on phage-host labels. We generated whole-genome embeddings for phages and candidate bacterial hosts with the Evo2-7B model, applied normalization, and ranked hosts by cosine similarity. Using the Virus-Host Database, we selected embedding and fusion choices on a Gram-positive validation cohort and then evaluated the approach on a held-out Gram-negative test cohort to minimize data leakage. We found that Evo2 was strongest at retrieving multiple plausible hosts, with the recorded host in the top 10 for 55.4% of phages. However, it did not maximize species-level top-1 accuracy (19.4% vs. 23.2% for the best baseline). At higher taxonomic ranks, Evo2 captured a coarser host-range signal: top-1 accuracy reached 43.4% at the genus level and 51.6% at the family level. Reciprocal rank fusion of Evo2 with BLASTN, VirHostMatcher, and PHIST improved all retrieval metrics. Top-10 retrieval rose to 58.5% and top-1 accuracy to 26.9%. Stratified analyses by phage genome length, host clade, and host mobile genetic element coverage revealed scenario-dependent performance. Evo2 embeddings excelled for intermediate-length phages and when host mobile element content was low, whereas alignment and k-mer methods dominated when local homology was abundant. These results suggest that pretrained genome embeddings complement established alignment- and k-mer/composition-based methods and that context-aware hybrid pipelines may help improve phage host prediction. Author summaryBacteriophages are viruses that prey on bacteria and play central roles in microbial ecosystems, nutrient cycling, and the spread of antibiotic resistance genes. Knowing which bacterium a phage can infect is important for applications such as phage therapy, where viruses are used to treat bacterial infections, but making this prediction from DNA sequence data alone remains difficult. Existing computational tools each exploit different types of genomic evidence, and none works reliably across all settings. We asked whether an artificial intelligence model trained to read raw DNA--without ever being shown which phages infect which hosts--could contribute a new, complementary signal. We found that this approach was particularly effective at narrowing the field to a short list of candidate hosts and at capturing broad evolutionary relationships between phages and bacteria. When we combined it with established sequence-comparison tools, overall prediction improved beyond what any single method achieved alone. By examining when each method succeeded or failed, we identified biological factors that govern prediction difficulty, offering practical guidance for building more robust prediction systems.
Pose-Lagoa, I.; Urda-Garcia, B.; Olvera, N.; Sanchez-Valle, J.; Faner, R.; Valencia, A.; Carbonell-Caballero, J.
Show abstract
Complex and clinically heterogeneous diseases pose significant challenges for gene prioritisation and patient stratification, as relevant genes often show weak or context-specific signals and transcriptomic datasets are limited in size. These limitations hinder the discovery of robust molecular signatures using traditional case-control approaches and motivate computational pipelines capable of capturing molecular diversity. Here, we present an explainable ensemble-based AI pipeline to prioritise disease-relevant genes from transcriptomic data, using Chronic Obstructive Pulmonary Disease (COPD) as a use case. To retain biologically relevant interactors obscured by molecular heterogeneity, the framework integrates data-driven signals with curated COPD-related gene sets, further expanded through network-based prioritisation and supported by molecular interactions. Gene relevance is evaluated via aggregated explainability scores across multiple classifier configurations to ensure robust candidate selection. The final set comprised < 8% of evaluated genes, [~] 62% arising from network-based expansion, substantially reducing dimensionality while preserving biological heterogeneity. Beyond case-control classification, the approach identified candidate genes and molecular subgroups associated with specific clinical features, capturing patient-level heterogeneity. The prioritised genes recapitulated key disease-related processes, including immune responses and extracellular matrix degradation, and highlighted additional associations like the enrichment of the IL-4 and IL-13 signalling pathway, which is of clinical interest given ongoing biologic developments targeting these axes. Our pipeline outperformed existing methods in discriminating COPD from controls, and the final gene list was validated in independent cohorts. Implemented as a scalable and reusable R package, this framework facilitates the study of molecular heterogeneity in complex diseases like COPD, supporting advances in diagnosis and precision medicine. Availability and implementationEBEx code and tutorials can be found in: https://iposelag.github.io/EBEx/
Garcia-Ruano, D.; Georges, M.; Mohanty, S. K.; Baaziz, R.; Makova, K. D.; Nikolski, M.; Chalopin, D.
Show abstract
BackgroundLong non-coding RNAs (lncRNAs) have gained significant attention in recent years, yet distinguishing them from protein-coding transcripts remains challenging. Indeed, many lncRNAs share mRNA-like processing and existing sequence-derived signals do not fully capture the coding/non-coding boundary. Recent GENCODE annotation efforts revealed tens of thousands of novel lncRNA sequences as well as the reclassification of some lncRNAs into the protein-coding class, highlighting the need to better characterize transcript features associated with classification uncertainty and errors. ResultsWe performed uncertainty-aware benchmarking by retraining and evaluating eight transcript classifiers under a controlled protocol on a label-stable GENCODE v46-v47 subset. Beyond conventional model evaluation metrics, we quantified inter-tool agreement and entropy-based uncertainty to stratify transcripts into consensus, discordant, and consensus-error groups. To expand standard sequence and ORF-derived signals, we incorporated repeat-derived features from mature transcripts and non-B DNA motif features across gene bodies. Although aggregate performance was high, [~]45% of transcripts showed inter-tool discordance, particularly among lncRNAs. Feature analyses linked low-uncertainty predictions to strong coding-like signals, whereas high-uncertainty profiles exhibited mixed signatures. Alongside classical predictors in global importance analyses, repeat-derived features appear as main contributors. ConclusionsBy combining controlled benchmarking with transcript-level agreement and uncertainty stratification, together with extended feature profiling, we identified patterns associated with classifier disagreement and misclassification. This novel framework provides practical guidance for interpreting predictions, motivating the development of more robust coding/non-coding classifiers, while also shedding light on the sequence properties that distinguish lncRNA sequences.
Szmigiel, A.; Gesteira Costa Filho, I.; Campello, R. J. G. B.
Show abstract
Clustering single-cell RNA-seq (scRNA-seq) data and related protocols remains a major challenge due to high dimensionality, sparsity, and noise. Despite numerous benchmarking studies aiming to identify the most suitable clustering methods, many suffer from methodological flaws that can undermine their conclusions. A major challenge in benchmarking is selecting representative datasets that cover the diversity of scRNA-seq experiments and include laboratory-verified labels for reliable evaluation. Consistent preprocessing of all inputs to benchmarked algorithms is crucial, as it significantly impacts performance. Beyond selecting an algorithm, a thorough exploration of hyperparameters is also essential to assess robustness and identify configurations that maximize performance. We focus on proposing an improved benchmarking framework that addresses common methodological issues in prior studies. We illustrate our proposed methodology in a case study comparing the classic Leiden and Louvain clustering algorithms with extensive hyperparameters exploration on a carefully curated collection of real gold standard datasets. By evaluating clustering performance across different hyper-parameter selection scenarios, we show that benchmarking results can be misleading, either overestimating or underestimating performance depending on how the hyperparameter space is explored. In our illustrative case study, benchmarking results do not reveal any practically relevant performance differences between the Louvain and Leiden algorithms. In contrast, we show that overlooked factors such as graph construction and quality functions critically influence clustering outcomes, particularly un-der suboptimal settings of numerical hyperparameters--the neighbor-hood size k used for similarity graph construction and the resolution hyperparameter in graph-based clustering algorithms. While noticeable trends have been observed in terms of how different (dis)similarity functions affect performance, the impact of this choice is limited and, to some extent, overridden by the graph-building approach. Across different graphs, there is a noticeable trade-off between achieving optimal performance with ideally tuned numerical hyperparameters and maintaining robustness under more realistic, unsupervised, and suboptimal settings. All in all, the analysis of our illustrative benchmarking case study offers clear guidance and objective recommendations for practitioners in the field. Most importantly, as the main contribution of this manuscript, our proposed framework sets a foundation for more reliable scRNA-seq clustering evaluation and benchmarking in future studies.
Choudhury, A.; Kitak, T.; Carrillo, B.; Busch, P.; Emons, M.; Gunz, S.; Koderman, M.; Luo, S.; Mallona, I.; Meara, A.; Wissel, D.; Robinson, M. D.
Show abstract
In the past few years, we have seen a veritable surge in single-cell (e.g., RNA sequencing) techniques and datasets, enabling increasingly detailed characterization of cellular heterogeneity across tissues and conditions. This surge in single-cell techniques has been complemented by a large number of analysis frameworks and pipelines, and a large parameter space and researcher degrees of freedom to use them. Many neutral benchmarks have been presented for various computational tasks, but most make design decisions that render them incompatible with each other, e.g., different datasets and metrics, or parameter sets used. In this work, we showcase a recently developed framework, Omnibenchmark, to build reproducible, extensible and standardized method comparisons. This not only facilitates the broad investigation of pipelines used in single-cell data analysis, but also highlights how the process of building benchmarks can be streamlined and unified. We do this as an initial proof-of-principle for an arms-length benchmark that evaluates five single-cell RNA sequencing pipelines (filtering to normalization to dimensionality reduction to clustering) on three datasets. This standardization enables benchmarks to be easily extended in several directions, including broader parameter sweeps, comparisons across software versions and architectures, isolation of pipeline steps, and integration of additional pipelines, datasets, and metrics.
Casals-Franch, R.; Nonell, L.; Villa-Freixa, J.; Lopez Garcia de Lomana, A.
Show abstract
Reconstructing dynamic immune cell state transitions from single-cell transcriptomic data requires coordinated analytical strategies that capture both phenotypic progression and underlying regulatory programs. This protocol describes a step-by-step computational workflow for analyzing human tumor-infiltrating T cells using the sequential application of dimensionality reduction, pseudotime trajectory inference, regulon activity analysis, and transcription factor-transcription factor network reconstruction. The workflow outlines data preprocessing and quality control, trajectory rooting and parameter selection, branch-specific differential analysis, and the integration of regulon inference to contextualize transcriptional programs along inferred trajectories. Regulon-based TF-TF network reconstruction is used as a downstream interpretive layer to identify regulatory modules associated with distinct cell-state transitions. Publicly available at GitHub repository https://github.com/rogercasalsfr/immuno-trajectory-grn-integrative-workflow, this protocol emphasizes practical considerations including parameter sensitivity, trajectory robustness, and consistency between phenotypic and regulatory outputs. The protocol supports reproducible analysis and interpretation of immune cell dynamics in human tumor microenvironment studies using single-cell RNA sequencing data.
Appel, J.; Butcher, N.
Show abstract
Protein function prediction remains a central challenge in computational biology due to the extreme sparsity and long-tail distribution of Gene Ontology (GO) [1] annotations. Advances in protein language models enable the extraction of dense, fixed-length representations from amino acid sequences, offering a scalable alternative to hand-picked features such as physicochemical properties. In this work, we evaluate a transformer-based embedding approach using ProtT5-XL combined with classical and modern multi-label classifiers for Gene Ontology prediction in the CAFA-6 setting. Fixed-length embeddings were generated via mean pooling of transformer hidden states and used as input to one-vs-rest logistic regression, gradient-boosted decision trees, and a neural network. Models were evaluated on held-out validation data with a focus on threshold selection, prediction sparsity, and behavior across frequent and rare GO terms. Gradient boosting consistently provided the best balance between predictive performance and stable prediction behavior, motivating its use for ontology-specific predictors across molecular function, biological process, and cellular component annotations. This study highlights practical modeling choices for large-scale protein function prediction using pretrained sequence embeddings and provides an interpretable baseline for future CAFA evaluations.
Cheng, Y.; Kettlewell, T.; Laidlaw, R. F.; Hardy, O. M.; McCluskey, A.; Otto, T. D.; Somma, D.
Show abstract
Accurate identification of differentially expressed genes (DEGs) in single-cell RNA sequencing (scRNA-seq) data remains challenging. Single-cell-specific statistical models often report large numbers of candidate genes but can exhibit inflated false positive rates, whereas pseudobulk approaches improve false discovery control at the cost of reduced sensitivity. To overcome the noise and bias that other tools have, and allow the user to have more control of the DEG process, we present CellDEEP, which uses a cell aggregation (metacell) approach. This tool provides a framework for flexible selection of pooling strategies and parameterisation for differential expression analysis (DE). Benchmarking on simulated and real datasets, including COVID-19 and rheumatoid arthritis, shows that CellDEEP often outperforms other methods, consistently reduces false positives compared to single-cell methods and recovers more true positives than pseudobulk methods. Our work shifts the focus from selecting a single "best" method to an approach that reduces cell-level noise while preserving biological signal, together with transparent validation framework, advancing more reliable differential-expression analysis in single-cell transcriptomics. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=189 HEIGHT=200 SRC="FIGDIR/small/710522v1_ufig1.gif" ALT="Figure 1"> View larger version (35K): org.highwire.dtl.DTLVardef@14692f9org.highwire.dtl.DTLVardef@5b37d6org.highwire.dtl.DTLVardef@aece11org.highwire.dtl.DTLVardef@5ade3d_HPS_FORMAT_FIGEXP M_FIG C_FIG
Furutani, T.; Ji, H.
Show abstract
While multimodal sequencing technologies are rapidly advancing, most single-cell and spatial datasets still measure only a single modality. Integrative computational methods for separately profiled single-cell RNA-seq (scRNA-seq) and ATAC-seq (scATAC-seq) data typically rely on the assumption that gene expression correlates with the chromatin accessibility of nearby regulatory regions. However, the strength and reliability of these correlations vary substantially across genes, and incorporating low-confidence associations can compromise integration accuracy. Here, we introduce the CLIC (Cross-modality Link Confidence) score, a quantitative measure of the empirical concordance between gene expression and nearby chromatin accessibility, derived from diverse single-cell multiome datasets from the ENCODE project. CLIC scores provide prior confidence estimates for gene-peak associations across modalities. Building on this, we propose a hybrid feature selection strategy that intersects highly variable genes with high-CLIC genes, generating feature sets that better align with the assumptions of cross-modal integration methods. Across diverse publicly available single-cell and spatial datasets, and multiple state-of-the-art integration frameworks, our approach consistently improves the integration of gene expression and chromatin accessibility data, enhancing both robustness and biological interpretability. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=69 SRC="FIGDIR/small/723400v1_ufig1.gif" ALT="Figure 1"> View larger version (18K): org.highwire.dtl.DTLVardef@13208b8org.highwire.dtl.DTLVardef@1da7808org.highwire.dtl.DTLVardef@1fe5c53org.highwire.dtl.DTLVardef@5f4e2a_HPS_FORMAT_FIGEXP M_FIG C_FIG
Wolfram-Schauerte, M.; Trust, C.; Waffenschmidt, N.; Nieselt, K.
Show abstract
Time-resolved transcriptomic profiling has been used to study phage-host interactions for more than a decade. However, the resulting datasets are not readily accessible for custom re-analysis, and resources are lacking that provide standardized processing, storage, and analysis of transcriptomes from phage infections. Here, we present the PhageExpressionAtlas, the first bioinformatics resource for storing time-resolved dual RNA-sequencing data from phage infections. This data was processed uniformly using a custom analysis pipeline and is presented for interactive exploration through visualisation. The PhageExpressionAtlas currently hosts 42 datasets from 23 studies. Using the PhageExpressionAtlas, we replicate key findings from original publications and extend hypothesis testing across multiple phage-host systems. By systematically querying and analyzing the underlying database, we evaluate approaches to phage gene classification and show that uncharacterized phage genes are expressed across all infection phases. Moreover, we provide a comprehensive view of the expression dynamics of anti-phage defenses as well as host- and phage-encoded anti-defense systems in the infection context, indicating unique and conserved patterns of transcriptional regulation underlying bacterial anti-phage immunity and phage counter-strategies. Together, the PhageExpressionAtlas is a unifying resource that democratizes transcriptomics-driven analyses of phage-host interactions and supports integrative cross-study assessment.
Forcier, T.; Cheng, E.; Tam, O. H.; Wunderlich, C.; Castilla-Vallmanya, L.; Jones, J. L.; Quaegebeur, A.; Barker, R. A.; Jakobsson, J.; Gale Hammell, M.
Show abstract
Transposable elements (TEs) are mobile genetic sequences that can generate new copies of themselves via insertional mutations. These viral-like sequences comprise nearly half the human genome and are present in most genome wide sequencing assays. While only a small fraction of genomic TEs have retained their ability to transpose, TE sequences are often transcribed from their own promoters or as part of larger gene transcripts. Accurately assessing TE expression from each individual genomic TE locus remains an open problem in the field, due to the highly repetitive nature of these multi-copy sequences. These issues are compounded in single-cell and single-nucleus transcriptome experiments, where additional complications arise due to sparse read coverage and unprocessed mRNA introns. Here we present our tool for single-cell TE and gene expression analysis, TEsingle. Using synthetic datasets, we show the problems that arise when not properly accounting for intron retention events, failing to address uncertainty in alignment scoring, and failing to make use of unique molecular identifiers for transcript resolution. Addressing these challenges has enabled an accurate TE analysis suite that simultaneously tracks gene expression as well as locus-specific resolution of expressed TEs. We showcase the performance of TEsingle using single-nucleus profiles from substantia nigra (SN) tissues of Parkinsons Disease (PD) patients. We find examples of young and intact TEs that mark dopaminergic neurons (DA) as well as many young TEs from the LINE and ERV families that are elevated in PD neurons and glia. These results demonstrate that TE expression is highly cell-type and cellular-state specific and elevated in particular subsets of neurons, astrocytes, and microglia from PD patients.
Abbasi, M.; Ochoa Zermeno, S.; Spendlove, M. D.; Tashi, Z.; Plaisier, C. L.; Bartelle, B. B.
Show abstract
Interpretable representations of gene expression are used to define cellular identities and the molecular programs active within cells, two related, but distinct phenomena. In the case of microglia, a cell type with high transcriptomic, functional, and morphological heterogeneity, the predominant representation of transcriptomic data presumes the adoption of distinct molecular identities, despite a lack of easily separable transcriptional states. Here, we explore alternative transcriptomic representations by comparing two single-cell analysis methods: differential expression analysis for identities and co-expression network analysis for molecular programs. For microglia, co-expression network analysis identifies highly significant functional ontologies not resolved by differential expression analysis. The identified co-expression modules are preserved across transcriptomic datasets and suggest reducible functional programs that activate and modulate depending on context. We conclude that co-expression analysis constitutes a best practice for single cell analysis of an individual cell type and describing microglia function as concurrent molecular programs offers a more parsimonious model of microglia function.
Brandulas Cammarata, A.; Fonseca Costa, S. S.; Rosikiewicz, M.; Roux, J.; Wollbrett, J.; Bastian, F. B.; Robinson-Rechavi, M.
Show abstract
RNA-Seq is a powerful technique to provide quantitative information on gene expression. While many applications focus on measuring expression levels, accurately distinguishing between actively and inactively transcribed genes is equally important for understanding gene function, development, and disease mechanisms. However, setting a biologically meaningful threshold for calling genes expressed is challenging due to variability in noise levels across different protocols, experiments or biological samples. We propose to define this threshold per sample relative to the background level observed in inactive genomic features, inferred by the amount of reads mapped to intergenic regions of the genome, and to call genes expressed if their level of expression is significantly higher than the estimated background noise. This approach can be applied to a single RNA-Seq library as well as to a combination of libraries from the same condition, in model and non-model organisms. We show that our method yields a more accurate prediction of expression state than existing methods, illustrated by consistent expression calls for biological replicates in the same tissue.
Vader, L.; Harvey, C. J.; Weber, T.; Hon, L. S.
Show abstract
Accurate gene prediction remains a major bottleneck in fungal genomics, where lineage diversity and alternative splicing challenge existing ab initio methods. Here, we present geneML, a deep learning-based gene prediction tool tailored to fungal genomes. Across nine reference genomes spanning diverse fungal taxa, geneML improved gene-level F1 score from 64.9 to 67.1 compared to BRAKER3 with protein-based hints, driven by substantially higher recall (69.0 vs. 64.1) at equivalent precision. geneML also remains fast, averaging around 6 minutes per genome on a standard 8-core CPU. A key feature of geneML is its ability to predict alternative transcripts. Compared to Fusarium graminearum Iso-Seq control data, it achieves 41.1% transcript recall and 71.1% precision, outperforming AUGUSTUS (33.8% recall, 48.9% precision), one of the few tools that support isoform prediction. The predicted transcript diversity is consistent with experimentally observed fungal alternative splicing patterns. Reannotation of the curated training dataset further suggests improved biological completeness, with geneML recovering 15.3% more genes containing complete PFAM domains than the reference annotation. These results demonstrate that geneML enables faster, more sensitive, and more biologically informative fungal genome annotation. geneML is available as an open-source command-line tool at https://github.com/hexagonbio/geneML. Key Points- geneML improves gene prediction accuracy over both classical and recent deep learning-based methods, while substantially improving recall. - geneML predicts alternative transcripts with higher precision and recall than AUGUSTUS, expanding functional annotation. - Runtime was 32-fold decreased over BRAKER3, enabling efficient high-throughput genome annotation. - geneML identifies novel genes and recovers missing annotations, especially in under-annotated non-Ascomycete genomes.
Aparicio-Puerta, E.; Baran, A. M.; Ashton, J. M.; Pritchett, E. M.; Gaca, A.; Becker, J.; Halushka, M. K.; Jun, S.-H.; McCall, M. N.
Show abstract
MicroRNAs are short noncoding RNAs that regulate gene expression and are commonly profiled by small RNA sequencing (miRNA-seq). Despite the widespread use of miRNA-seq, datasets are often analyzed with RNA-seq method such as DESeq2 or edgeR, which do not take into account the specific characteristics of miRNA-seq data. Here, we present a benchmark study of normalization and differential expression approaches using a realistic ground-truth dataset. By mixing mouse RNA of two organs, we generated expression trends while capturing biological and technical variability. Using monotonicity across the dataset and expected fold changes from the mixture design, we assessed normalization and differential expression methods. Normalization benchmarking showed that within-sample scaling, particularly Read Per Million (RPM), best preserved the expected monotonic trends, outperforming cross-sample methods such as TMM, rlog, and VST. These approaches sometimes recovered apparent monotonicity among abundant miRNAs, but inspection of individual profiles suggested likely over-correction. Regarding differential expression, edgeR consistently ranked among the best-performing methods across several metrics, including log2 fold-change estimation, with performance comparable to miRNA-seq-specific tools such as miRglmm and NBSR. DESeq2, edgeR-v4, and limma-based approaches tended to systematically underestimate log2 fold changes. Applying a common RPM-based normalization substantially improved the performance of cross-sample methods, highlighting the strong influence of normalization on differential expression analysis. Overall, our findings support within-sample scaling methods such as RPM for normalization, and edgeR, miRglmm, or NBSR for differential expression. The dataset has been made publicly available, providing a valuable resource for objective method comparison and future miRNA-seq software development.
Zhang, X.
Show abstract
Large language model agents are increasingly used for bioinformatics tasks that require external databases, tool use, and long multi-step retrieval workflows. However, practical evaluation of these systems remains limited, especially for prompts whose target set is both large and biologically heterogeneous. Here, I benchmarked three agent systems on the same difficult retrieval task: downloading coccolithophore calcification-related proteins from UniProt across six mechanistically distinct categories, while producing category-separated FASTA files and supporting evidence. The compared systems were Codex app agents extended with Claude Scientific Skills, Biomni Lab online, and DeerFlow 2 with default skills only. Outputs were normalized at the UniProt accession level and compared category by category using overlap analysis, Venn decomposition, and a heuristic relevance assessment of each subset relative to the benchmark prompt. Across the six shared categories, Codex retrieved 2,118 proteins, DeerFlow 6,255, and Biomni 8,752 in a run. Codex showed the best balance between sensitivity and specificity: 92.4% of its proteins fell into subsets labeled high relevance and the remaining 7.6% into medium relevance. DeerFlow was substantially more exhaustive, but 43.8% of its proteins fell into low or low-medium relevance subsets. Biomni produced the largest sets, yet 69.5% of its proteins fell into low or low-medium relevance subsets, mainly due to broad expansion into generic calcium sensors, kinases, transcription factors, and poorly specific domain families. Category-specific analysis showed that Codex was the strongest primary source for inorganic carbon transport, calcium and pH regulation, vesicle trafficking, and signaling, whereas DeerFlow contributed valuable complementary matrix and polysaccharide candidates. A second run for each system also separated them strongly by repeatability: Codex had the highest within-system stability (mean category Jaccard 0.982; micro-Jaccard 0.974), DeerFlow was intermediate (0.795; 0.571), and Biomni was least stable (0.412; 0.319). These results suggest that for complex protein-family retrieval tasks, agent quality depends less on raw output volume than on prompt decomposition, taxonomic scoping, exact query generation, provenance-rich export artifacts, and repeated-run stability.